Louis Othen - S21002027
import pandas as pd # Used for Data manipulation/ Analysis.
import pandas_profiling as pf # EDA tool against pandas data
import os as os # For system related operations
import plotly.express as px # For Data Visualisation
from sklearn.neighbors import KNeighborsClassifier # To implement K-Nearest Neighbors Algorithm
from sklearn.linear_model import LogisticRegression # To implement Logistic Regression Algorithm
from sklearn.svm import SVC # To implement Support Vector Machine Algorithm
from sklearn.preprocessing import StandardScaler # Used to standardise values so features can be comparable/ less effect by outliers.
from sklearn.model_selection import train_test_split # Used to split datasets ready for modelling
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix # To show Confusion matrix of model prediction results
from sklearn.metrics import accuracy_score # to show accuracy core of model(s)
from sklearn.metrics import roc_auc_score ,auc,roc_curve # to show accuracy core of model(s)
Now that all relevant libraries have been imported into the notebook for the task at hand, the next step is to load in the downloaded titanic datasets from Kaggle, into a pandas dataframe.
# Go to working directory
#-----------------------------------------------------------------
folder_path = 'C:\\Users\\lothe\\OneDrive\\Wrexham Uni (Masters)\CONL708 - Machine Learning\Summative Assignments\\titanic'
os.chdir(folder_path)
# Load in titanic data
#------------------------------------------------------------------
titanic_data = pd.read_csv('train.csv')
new_titanic = pd.read_csv('test.csv')
# Preview train dataset
#------------------------------------------------------------------
display(titanic_data.head())
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# Preview test dataset
#------------------------------------------------------------------
display(new_titanic.head())
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
pf.ProfileReport(titanic_data)
Summarize dataset: 100%|██████████| 52/52 [00:02<00:00, 23.59it/s, Completed] Generate report structure: 100%|██████████| 1/1 [00:01<00:00, 1.34s/it] Render HTML: 100%|██████████| 1/1 [00:00<00:00, 1.78it/s]
First, we need to separate the target variable (Survived) into a new variable (Y) incorporated later for use in the modelling phase; away from the independent variables, which will go into another variable (X_train).
titan_train = titanic_data.copy() # Take a copy of titanic data, to preserve the original intact
titan_train = titan_train[[
'PassengerId'
,'Pclass'
,'Name'
,'Sex'
,'Age'
,'SibSp'
,'Parch'
,'Ticket'
,'Fare'
,'Cabin'
,'Embarked'
]]
Y = titanic_data['Survived'] # To store training label
Now that the training and test versions of the titanic dataset has been successfully loaded, the next stage can begin of performing transformations on the data, so that it is ready for modelling against. These steps could include, how missing values are handled, placing numerical values onto the same scale, removing unnecessary columns and so forth.
From an initial glance, there appear to some columns in both datasets that cannot be used, particularly Name, Ticket, and PassengerID. An argument could be made in that Name could show indicators of survival, such as Doctor or Reverend, but may include bias at this stage , so will still be removed.
# Removal of columns
#---------------------------------------------------------------------
titan_train.drop(columns = ['PassengerId','Ticket','Name'], inplace = True)
The next attribute to deal is with Sex; this appears to be a potential important feature for our modelling purposes, but cannot be used in its current format. Therefore we can employ the use of one-hot encoding to convert into a numeric binary value. The new values will show as 1 (male) and 0 (female.)
# Conversion of Sex column with one-hot encoding
#---------------------------------------------------------------------
sex_dummy_tr = pd.get_dummies(titan_train.Sex) # One-hot encoding for training Data
titan_train['Gender'] = sex_dummy_tr.male # Add converted column back to training data
titan_train.drop(columns = 'Sex', inplace = True) # Remove older column from training data
Now to focus on the Cabin attribute, we have seen from the EDA report that there appears to be a mixture of passengers assigned or recorded to have a cabin, and those who did not. For this reason, it seems inadvisable to remove the data, as it could prove to eb an important feature the models may consider at to who would survive the disaster. With that in mind, instead of removing any data here, the aim is to conver it to a binary value with haing a cabin (1) or not (0). The first step would be to convert any NULL values into a 0, and observations with cabin numbers assigned to 1.
titan_train['Cabin'].fillna(0,inplace = True) # For any Cabin values that are NULL/NaN, replace with 0
titan_train['Cabin'] = titan_train['Cabin'].apply(lambda x: 1 if x != 0 else x) # If the Cabin value is not 0, then assign it as 1
As can be seen from the data, there are values missing from the Age column within the dataset. Therefore in the first instance, rather than removing the rows with missing values - instead data imputation can be performed to provide the mean value across the values to fill in these gaps. This ensures as much of the available data can be used, without losing information that could be modelled upon.
titan_train['Age'] = titan_train['Age'].apply(lambda x: titan_train['Age'].mean().round(0) if pd.isnull(x) else x)
Based on two attributes found in the dataset, they appear to be similar in nature. SibSp - represents the number of siblings or spouses aboard with the passenger; whilst Parch describes the number of parents or children aboard with the passenger. Therefore, these two attributes can be reformed into one feature known as family size, to reduce cardinality slightly. One has been added to the result, to represent a passenger traveling alone otherwise.
titan_train['Family_size'] = titan_train['SibSp'] + titan_train['Parch'] + 1 # Sums the two attributes together, plus one to represent a passenger traveling alone.
titan_train.drop(columns=['SibSp','Parch'],inplace = True) # Remove SibSp and Parch columns
The second to final pre-processing step to be taken before the modelling phase can begin. With the Embarked column - that represents the port in which the passenger boarded the titanic from- either Southampton in the UK (S) , Cherbourg in Normandy (C) , or Queenstown - now known as Cobh in Ireland (Q). However, in its current form, most models would not be able to take these categorical values as they are, but through one-hot encoding once again, this can be achieved.
embarked_dummies = pd.get_dummies(titan_train.Embarked) # Perform One-hot encoding on Embarked Column
titan_train['Emb_Southampton'] = embarked_dummies['S'] # Column for passengers who embarked from Southampton
titan_train['Emb_Cherbourg'] = embarked_dummies['C'] # Column for passengers who embarked from Cherbourg
titan_train['Emb_Queenstown'] = embarked_dummies['Q'] # Column for passengers who embarked from Queenstown
titan_train.drop(columns='Embarked', inplace = True)
The Last step in preprocessing, which involves scaling all input values to standardise them.
sc = StandardScaler() # Prepares for scaling to commence
X = sc.fit_transform(titan_train) # Scales the training data.
Now all the preprocessing is complete, the data from the train.csv file can be split into training and testing datasets to run the three ML models against. Note that the data from test.csv is used to perform predictions where the model(s) have not seen the data previously. Will split the train.csv data into 70% used for training the model, and 30% test the model performance.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size= 0.33,random_state = 27)
for the K-NN model, a for loop shall be iterated between values 1-20, to see what number is the most optimal to provide as k.
knn_results = pd.DataFrame()
accuracy = []
confusion_mat = []
k = []
class_rpt = []
# Loop through k between 1-20 finding best k to based on accuracy
#------------------------------------------------------------
for i in range(1,21):
knn = KNeighborsClassifier(n_neighbors= i)
knn.fit(X_train,Y_train)
knn_y_pred = knn.predict(X_test)
k.append(i)
accuracy.append(accuracy_score(Y_test,knn_y_pred))
confusion_mat.append(confusion_matrix(Y_test,knn_y_pred))
class_rpt.append(classification_report(Y_test,knn_y_pred,output_dict=True))
# Place acquired metrics into summary table
# ----------------------------------------------------------
knn_results['k'] = k
knn_results['accuracy'] = accuracy
knn_results['confusion_matrix'] = confusion_mat
# Sort dataframe to show highest accuracy score on top
#-----------------------------------------------------------
knn_results = knn_results.sort_values(by = ['accuracy'],ascending = False)
Now the above code has executed, the results have been stored into a dataframe, showing the k used, as well as accuracy score and confusion matrix against it.
knn_results
| k | accuracy | confusion_matrix | |
|---|---|---|---|
| 5 | 6 | 0.844068 | [[173, 14], [32, 76]] |
| 9 | 10 | 0.837288 | [[170, 17], [31, 77]] |
| 15 | 16 | 0.837288 | [[170, 17], [31, 77]] |
| 19 | 20 | 0.833898 | [[168, 19], [30, 78]] |
| 6 | 7 | 0.833898 | [[166, 21], [28, 80]] |
| 7 | 8 | 0.833898 | [[170, 17], [32, 76]] |
| 13 | 14 | 0.833898 | [[171, 16], [33, 75]] |
| 17 | 18 | 0.830508 | [[167, 20], [30, 78]] |
| 16 | 17 | 0.830508 | [[166, 21], [29, 79]] |
| 8 | 9 | 0.830508 | [[165, 22], [28, 80]] |
| 12 | 13 | 0.827119 | [[164, 23], [28, 80]] |
| 18 | 19 | 0.827119 | [[165, 22], [29, 79]] |
| 10 | 11 | 0.827119 | [[163, 24], [27, 81]] |
| 3 | 4 | 0.827119 | [[171, 16], [35, 73]] |
| 11 | 12 | 0.823729 | [[168, 19], [33, 75]] |
| 14 | 15 | 0.823729 | [[164, 23], [29, 79]] |
| 4 | 5 | 0.823729 | [[163, 24], [28, 80]] |
| 2 | 3 | 0.820339 | [[161, 26], [27, 81]] |
| 0 | 1 | 0.776271 | [[155, 32], [34, 74]] |
| 1 | 2 | 0.769492 | [[174, 13], [55, 53]] |
Based on the iteration above, it appears that k = 6 is the most optimal number to provide as part of the model build. In light of this, this configuration can now be applied to make the model formally.
KNN = KNeighborsClassifier(n_neighbors= 6)
KNN.fit(X_train,Y_train)
KNN_y_pred = KNN.predict(X_test)
print(confusion_matrix(Y_test,KNN_y_pred))
print(accuracy_score(Y_test,KNN_y_pred))
print(classification_report(Y_test,KNN_y_pred))
[[173 14]
[ 32 76]]
0.8440677966101695
precision recall f1-score support
0 0.84 0.93 0.88 187
1 0.84 0.70 0.77 108
accuracy 0.84 295
macro avg 0.84 0.81 0.83 295
weighted avg 0.84 0.84 0.84 295
Now to build a model and show the performance on the training data using Logistic Regression.
regr = LogisticRegression(solver='liblinear', random_state=1)
regr.fit(X_train,Y_train)
log_y_pred = regr.predict(X_test)
print(confusion_matrix(Y_test,log_y_pred))
print("Accuracy for training data : ",accuracy_score(Y_test,log_y_pred))
print("Auc score : " , roc_auc_score(Y_test,regr.predict_proba(X_test)[:, 1]))
print(classification_report(Y_test,log_y_pred))
[[160 27]
[ 33 75]]
Accuracy for training data : 0.7966101694915254
Auc score : 0.852619330560507
precision recall f1-score support
0 0.83 0.86 0.84 187
1 0.74 0.69 0.71 108
accuracy 0.80 295
macro avg 0.78 0.78 0.78 295
weighted avg 0.79 0.80 0.80 295
Based on the predictions of the logistic regression model, the accuracy score comes in at approximately 79.66%., with a AUC score of 77.5%
fpr, tpr, thresholds = roc_curve(Y_test, log_y_pred)
fig = px.area(
x=fpr, y=tpr,
title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
labels=dict(x='False Positive Rate', y='True Positive Rate'),
width=700, height=500
)
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()
Finally, to create a model on the data, via the use SVM.
clf_svm = SVC(gamma='auto')
clf_svm.fit(X_train,Y_train)
svm_y_pred = clf_svm.predict(X_test)
print(confusion_matrix(Y_test,svm_y_pred))
print("Accuracy for training data : ",accuracy_score(Y_test,svm_y_pred))
print(classification_report(Y_test,svm_y_pred))
[[166 21]
[ 28 80]]
Accuracy for training data : 0.8338983050847457
precision recall f1-score support
0 0.86 0.89 0.87 187
1 0.79 0.74 0.77 108
accuracy 0.83 295
macro avg 0.82 0.81 0.82 295
weighted avg 0.83 0.83 0.83 295
fpr, tpr, thresholds = roc_curve(Y_test, svm_y_pred)
fig = px.area(
x=fpr, y=tpr,
title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
labels=dict(x='False Positive Rate', y='True Positive Rate'),
width=700, height=500
)
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()
Now that we have created the models against the training data from the titanic dataset, now they need to be applied to the test data ,as if the model is seeing new data. However before we can commence, the pre-processing steps need to be repeated agains the new data ( placed into a function this time for ease).
def preprocess(df):
df.drop(columns = ['PassengerId','Ticket','Name'], inplace = True) # Removal of unneeded columns
sex_dummy_te = pd.get_dummies(df.Sex) # One-hot encoding on Sex columns
df['Gender'] = sex_dummy_te.male # Add converted Sex column back
df.drop(columns = 'Sex', inplace = True) # Remove Sex column
df['Cabin'].fillna(0,inplace = True) # For any Cabin values that are NULL/NaN, replace with 0
df['Cabin'] = df['Cabin'].apply(lambda x: 1 if x != 0 else x) # If the Cabin value is not 0, then assign it as 1
df['Age'] = df['Age'].apply(lambda x: df['Age'].mean().round(0) if pd.isnull(x) else x) # If any values missing from Age column, impute the average value to fill in the blank.
df['Family_size'] = df['SibSp'] + df['Parch'] + 1 # Sums the two attributes together, plus one to represent a passenger traveling alone.
df.drop(columns=['SibSp','Parch'],inplace = True) # Remove SibSp and Parch columns
embarked_dummies = pd.get_dummies(df.Embarked) # Perform One-hot encoding on Embarked Column
df['Emb_Southampton'] = embarked_dummies['S'] # Column for passengers who embarked from Southampton
df['Emb_Cherbourg'] = embarked_dummies['C'] # Column for passengers who embarked from Cherbourg
df['Emb_Queenstown'] = embarked_dummies['Q'] # Column for passengers who embarked from Queenstown
df.drop(columns='Embarked', inplace = True) # Drop Embarked Column
sc = StandardScaler() # Prepares for scaling to commence
df = sc.fit_transform(df) # Scales the data.
return df
new_data = preprocess(new_titanic)